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Abstract —We consider the problem of noisy 1-bit matrix 
completion under an exact rank constraint on the true underlying 
matrix M*. Instead of observing a subset of the noisy continuous- 
valued entries of a matrix M*, we observe a subset of noisy 1-bit 
(or binary) measurements generated according to a probabilistic 
model. We consider constrained maximum likelihood estimation 
of M*, under a constraint on the entry-wise infinity-norm of 
M* and an exact rank constraint. This is in contrast to previous 
work which has used convex relaxations for the rank. We provide 
an upper bound on the matrix estimation error under this 
model. Compared to the existing results, our bound has faster 
convergence rate with matrix dimensions when the fraction of 
revealed 1-bit observations is fixed, independent of the matrix 
dimensions. We also propose an iterative algorithm for solving 
our nonconvex optimization with a certificate of global optimality 
of the limiting point. This algorithm is based on low rank 
factorization of M*. We validate the method on synthetic and 
real data with improved performance over existing methods. 


I. INTRODUCTION 

The problem of recovering a low rank matrix from an 
incomplete or noisy sampling of its entries arises in a variety 
of applications, including collaborative filtering [1] and sensor 
network localization [2], [3]. In many applications, the obser¬ 
vations are not only missing, but are also highly discretized, 
e.g. binary-valued (1-bit) [4], [5], or multiple-valued [6]. Eor 
example, in the Netflix problem where a subset of the users’ 
ratings is observed, the ratings take integer values between 
1 and 5. Although one can apply existing matrix completion 
techniques to discrete-valued observations by treating them as 
continuous-valued, performance can be improved by treating 
the values as discrete [4]. 

In this paper we consider the problem of completing a 
matrix from a subset of its entries, where instead of observ¬ 
ing continuous-valued entries, we observe a subset of 1-bit 
measurements. Given M* € a subset of indices U C 

[m] X [n], and a twice differentiable function / : R —>■ [0,1], 
we observe (“w.p.” stands for “with probability”) 


(+1 w.p. /(M*), 
1-1 w.p. 


for {i,j) e n. (1) 


One important application is the binary quantization of Yij = 
M*j + Zij, where Z is a noise matrix with i.i.d entries. If we 
take / to be the cumulative distribution function of —Zn, then 
the model in (1) is equivalent to observing 


r -bi if M* -f z,j > 0 
1-1 if <0 


for(i,j)eU. (2) 


Recent work in the 1-bit matrix completion literature has 
followed the probabilistic model in (l)-(2) for the observed 
matrix Y and has estimated M* via solving a constrained 


maximum likelihood (ML) optimization problem. Under the 
assumption that M* is low-rank, these works have used convex 
relaxations for the rank via the trace norm [4] or max-norm 
[5]. An upper bound on the matrix estimation error is given 
under the assumptions that the entries are sampled according 
to a uniform distribution [4], or in [5], following a non-uniform 
distribution. 

In this paper, we follow [4], [5] in seeking an ML estimate 
of M* but use an exact rank constraint on M* rather than a 
convex relaxation for the rank. We follow the sampling model 
of [7] for fl which includes the uniform sampling of [4] as 
well as non-uniform sampling. We provide an upperbound 
on the Erobenius norm of matrix estimation error, and show 
that our bound yields faster convergence rate with matrix 
dimensions than the existing results of [4], [5] when the 
fraction of revealed 1-bit observations is fixed independent 
of the matrix dimensions. Lastly, we present an iterative 
algorithm for solving our nonconvex optimization problem 
with a certificate of global optimality under mild conditions. 
Our algorithm outperforms [4], [5] in the presented simulation 
example. 

Notation: Eor matrix A with (i, j)-th entry we use the 

notation ||A||oo = max|Ayj for the entry-wise infinity-norm, 

i,3 

II A||i7’ for the Erobenius norm and || A ||2 for its operator norm. 
We use Ai^. to denote the i-th row and A. j to denote the j- 
th column. Taking 5 to be a set, we use |5| to denote the 
cardinality of S. The notation [n] represents the set of integers 
n}. We denote by 1„ € R" the vector of all ones, by 
In the unit vector Xnj\pn, and by the indicator function, 
i.e. = 1 when p is true, else = 0. 

II. MODEL ASSUMPTIONS 

We wish to estimate unknown M* using a constrained ML 
approach. We use M € R™^" to denote the optimization 
variable. Then the negative log-likelihood function for the 
given problem is 

Eh.y(M) = - ^ |l(Yy=i) log(/(My)) 

(i,j)en 

+ Vo—i) (3) 

Note that (3) is a convex function of X when the function / 
is log-concave. Two common choices for which the function 
/ is log-concave are: (i) Logit model with logistic function 
f{x) = 1/(1 -I- and parameter cr > 0, or equivalently 

Zij in (2) is logistic with scale parameter cr; (ii) Probit model 
with f{x) = $(a:/cr) where cr > 0 and $(a;) is the cumulative 
distribution function of Af{0, 1). We assume that M* is a low- 
rank matrix with rank bounded by r, and that the true matrix 


M* satisfies ||M*||oo < a, which helps make the recovery 
of M* well-posed by preventing excessive “spikiness” of the 
matrix. We refer the reader to [4], [5] for further details. 

The constrained ML estimate of interest is the solution to 
the optimization problem (s.t.: subject to); 

M = argminFo_y(M) s.t. ||M||oo < a, rank(M) < r. 

(4) 

In many applications, such as sensor network localization, 
collaborative filtering, or DNA haplotype assembly, the rank r 
is known or can be reliably estimated [8]. 

We now discuss our assumptions on the set Q,. Consider 
a bipartite graph G = ([to], [n],E), where the edge set E C 
[to] X [n] is related to the index set of revealed entries O as 
(i,j) G E iff (j,j) G ff. Abusing the notation, we use G for 
both the graph and its bi-adjacency matrix where Gij = 1 if 
(f, j) G E, Gij = 0 if (i, j) ^ i?. We denote the association of 
G to fl by G\f2. Without loss of generality we take m > n. 
We assume that each row of G has d nonzero entries (thus 
jflj = md) with the following properties on its SVD: 

(Al) The left and right top singular vectors of G are 
Im/v/ro and ^Jn, respectively. This implies that 
(Ti(G) = d\pmjn > d, where cri(G) denotes the 
largest singular value of G, and that each column of 
G has (rnd/n) nonzero entries. 

(A2) We have ai{G) < Gy/d, where (T 2 (G) denotes the 
second largest singular value of G and G > 0 is some 
universal constant. 


Thus we require G to have a large enough spectral gap. As 
discussed in [7], an Erdos-Renyi random graph with average 
degree d > clog(m) satisfies this spectral gap property 
with high probability, and so do stochastic block models for 
certain choices of inter- and intra-cluster edge connection 
probabilities. Thus, this sampling scheme is more general 
than a uniform sampling assumption, used in [4], and it 
also includes the stochastic block model [7] resulting in non- 
uniform sampling. 

III. PERFORMANCE UPPERBOUND 


We now present a performance bound for the solution to 
(4). With f{x) := {df{x)/dx), define 


7a < min inf 

\ \x 

inf 


k“<a ] P{x) f{x) 
pp) 


fp) 


kl<a I (1 -/(a;))2 l-/(a;) 

fix) 

La > sup 


l.lia]f{x){l-f{x)) ’ 


(5) 


(6) 


where a is the bound on the entry-wise infinity-norm of M 
(see (4)). For the logit model, we have La = l/cr, and 7 ^ = 
~ > 0. For the probit model we obtain 

ia < |(f + 1 ), la > exp (-i^) > 0. For further 

reference, define the constraint set 


C := {M G : l]Ml|oo < a, rank(M) < r] . (7) 


Theorem 3.1: Suppose that M* G C, and G\f2 satisfies 
assumptions (Al) and (A2), with m > n. Further, suppose Y 
is generated according to (1) and f{x) is log-concave in x. 
Then with probability at least 1 — Gi exp(—G 2 TO), any global 
minimizer M of (4) satisfies 



\\M — M*\\f < max 


Gi»m2(G) 

ai(G) 


G2a'mVr^n\ 


rKG) 


< max 


GiaGry/m 

~7W~ 



( 8 ) 

(9) 


provided "fa > 0. Here, Gi,G 2 > 0 are universal constants, 
G > 0 is given by assumption (A2), and 


Cia = dV2a , G2a = 32.16V2L„/7„ , 


with "fa and L 2 a given by (5), (6). 

Proof of this theorem is given in Sec. VI. Of particular interest 
is the case where p = — is fixed and we let to and n become 
large, with m/n = S > 1 fixed. In this case we have the 
following Corollary. 

Corollary 3.2: Assume the conditions of Theorem 3.1. 
Let p = be fixed independent of to and n. Then with 
probability at least 1 — Ci exp{—C 2 m), any global minimum 
M to (4) satisfies 

1 '—' f ^ IT^\ 

-J- . (10) 

VTOn n j 


A. Comparison with previous work 

Consider M* G with p fraction of its entries 

sampled, such that jjM*j|oo < a (also assumed in [4], [5]) 
and rank(M*) < r. Then m = n, and jOj = pp. The bounds 
proposed in [4] (and [5] in case of uniform sampling) yields 

^\\M-M*\\%<o(.[^) , (11) 

\\pn/ 

whereas, applying our result (Corollary 3.2), we obtain 

. ( 12 ) 

n \p n) 

Comparing (11) and (12), we see our method has faster 
convergence rate in n for fixed rank r and fraction of revealed 
entries p. Notice that if the number of missing entries scales 
with n according to p ~ 0(l/n), [4] yields bounded error 
while our bound grows with n; in our case we need p to be 
of order at least We believe this to be an artifact of 

our proof, as our numerical results (Fig. 1) show our method 
outperforms [4], especially for low values of p and higher 
values of rank r. 


IV. OPTIMIZATION 

We will solve the optimization problem (4) using a log- 
barrier penalty function approach [9, Sec. 11.2]. The constraint 
maxij \Mij\ < a translates to the log-barrier penalty function 


















— log (l — {Mij/a)‘^y This leads to the regularized objective 
function 

Fn^riM) = Fa,y(M) - A ^ log (l - (13) 

(»j) 

and the optimization problem 

M = argminFo_Y(M) subject to rank(M) < r. (14) 

We can account for the rank constraint in (14) via the fac¬ 
torization technique of [10]-[12] where instead of optimizing 
with respect to M in (4), M is factorized into two matrices 
U G and V G such that M = UV^. One then 

chooses k = r + \ and optimizes with respect to the factors 
t/, V. The reformulated objective function is then given by 

Fn.viU, V) = F^y{UV^) - A ^ log (l - {U,yl/af) 


where Ui_. denotes the i-th row of U, and V,. the j-th row of 
V. The parameter A > 0 sets the accuracy of approximation of 
maxij \Mij\ < a via the log-barrier function. We solve this 
factored version using a gradient descent method with back¬ 
tracking line search, in a sequence of central path following 
solutions [9, Sec. 11.2], where one gradually reduces A toward 
0. Initial values of U, V are randomly picked and scaled to 
satisfy < 0.95a. Starting with a large Aq, we solve 

for A = Aq, Ao/2, Ao/4, • • • via central path following and use 
5-fold cross validation error over A as the stopping criterion 
in selecting A. 


1 bit matrix completion, probit, n=200 



Fig. 1: Relative MSE ||M — M*|||,/||M*|||, for varied values 
of p. Probit model, rank r, a = 0.18, n = 200, a = 1. 
“trace-norm” refers to [4], “max-norm” is the method of [5] 
for known r. 

Remark 4.1: The hard rank constraint results in a non- 
convex constraint set. Thus, (4) and (14) are nonconvex 
optimization problems; similarly for minimization of (15) for 
which the rank constraint is implicit in the factorization of 
M. However, the following result is shown in [10, Proposition 
4], based on [11], for nonconvex problems of this form. If 
([/*, V*) is a local minimum of the factorized problem, then 
M = U*V*^ is the global minimum of problem (14), so 
long as U* and V* are rank-deficient. (Rank deficiency is a 
sufficient condition, not necessary.) This result is utilized in 
[12] and [5] for problems of this form. 


V. NUMERICAL EXPERIMENTS 
A. Synthetic Data 

In this section, we test our method on synthetic data and 
compare it with the methods of [4], [5]. We set m = n 
and construct M* G as M* = MiMj where Mi 

and M2 are n X r matrices with i.i.d. entries drawn from a 
uniform distribution on [—0.5,0.5] (as in [4], [5]). Then we 
scale M* to achieve ||M*||oo = 1 = a. We pick r = 5,10, 
vary matrix sizes n = 100, 200, or 400. We generate the set 
U of revealed indices via the Bernoulli sampling model of 
[4] with p fraction of revealed entries. We consider the probit 
(cr = 0.18, as in [4], [5]) model. Eor Eig. 1, we take n = 200 
and vary p. The resulting relative mean-square error (MSE) 
||M — Af*|||,/||M*|||, averaged over 20 Monte Carlo runs 
is shown in Eig. 1. As expected, the performance improves 
with increasing p. Eor comparison, we have also implemented 
the methods of [4], [5], labeled “trace norm”, and “max- 
norm” respectively. As we can see our proposed approach 
significantly outperforms [4], [5], especially for low values of 
p and high values of r. 



Eig. 2; Log-log plots of relative MSE for varied n 



P 

Eig. 3: Relative MSE versus p under the probit model, for 
varied p, fixed p + q = 0.7 

In Eig. 2 we show the relative MSE for n = 100, 200,400, 
p = 0.2, 0.4, 0.6 for the probit model using our approach. We 
also plot the line 1/n in Eig. 2 to show the scale of the upper 
bound O established in Section III. As we can see, 

the empirical estimation errors follow approximately the same 
scaling, suggesting that our analysis is tight, up to a constant. 




















We additionally plot the MSE for n = 200 and r = 5 in 
Fig. 3, with varying p and keeping p+q = 0.7, under the probit 
model. This enables us to study the performance of the model 
under nonuniform sampling. Note that when p = q = 0.35, 
the spectral gap is largest and MSE is the smallest, and as p 
gets larger, the spectral gap decreases, leading to larger MSE. 


% Training 

% Ac 
Proposed 

curacy (Logit 
Trace-norm 

Model) 

Max-norm 

95 

72.3±0.7 

72.4±0.6 

71.5±0.7 

10 

60.4±0.6 

58.5±0.5 

58.4±0.6 

5 

53.7±0.8 

49.2±0.7 

50.3±0.2 


TABLE I: Accuracy of the proposed, trace-norm [4] and 
max-norm [5] approaches on the MovieLens 100k dataset 
for different amounts of training data. Accuracy represents 
the percentage of test set ratings for which the estimate of 
M* accurately predicts the sign, i.e., whether the unobserved 
ratings were above or below the average rating. 


B. MovieLens (100k) Dataset 

As in [4], we consider the MovieLens (100k) dataset 
(http;//www.grouplens.org/node/73). This dataset consists of 
100,000 movie ratings from 943 users on 1682 movies, with 
ratings on a scale from 1 to 5. Following [4], these ratings were 
converted to binary observations by comparing each rating 
to the average rating for the entire dataset. We used three 
splits of the data into training/test subsets and used 20 random 
realizations of these splits. The performance is evaluated by 
checking to see if the estimate of M* accurately predicts the 
sign of the test set ratings (whether the observed ratings were 
above or below the average rating). As in [4], we determine 
the needed parameter values by performing a grid search and 
selecting the values that lead to the best performance; we 
fixed a = 1 , and varied A (i.e. central path following), a and 
rank r. Our performance results are shown in Table I using a 
logistic model for three approaches: proposed, [4], [5]. These 
results support our findings on synthetic data that our method 
is preferable over [4], [5] for sparser data. 

VI. PROOF OF THEOREM 3.1 

Our proof is based on a second-order Taylor series expan¬ 
sion and a matrix concentration inequality. 

Let e = vec(M) e K™" and FnyiO) = Fnx{M). The 
objective function Fq^y{M) is continuous in M and the set C 
is compact, therefore, Fq^y{M) achieves a minimum in C. If 
0 = vec(M) minimizes Fo y( 0 ) subject to the constraints, 
then Fq,y{6) < Fq^,y{S*) where 9* = vec{M*). By the 
second-order Taylor’s theorem, expanding around 9* we have 

Fn^YiS) =FQ^Yid*) + (V6iEh,y(0*), 9 — 9*) 

+ - 9*, (vl,Fn,Y{0)) (9 - 9*)) (16) 

where 9 = 9* + j(9 — 9*) for some 7 G [0,1], with 
corresponding matrix M = M*+j{M—M*). We need several 
auxiliary results before we can prove Theorem 3.1. 


Using (3), it follows that 

9Fn,y(M) _ / 

dMik ~ \ f{M,k) 

+T^ 7 (A^V«=-i) 1 I((Gfe) 6 n), (17) 


d^Fn.Y{M) r f 

[\P{Mek) fiMik)) 


1 - /(Mtt) 




I(y«fc=-i) 




(18) 


and 


d'^Fn,Y{M) 

dM^^ki dMe^k2 


0 if {ii,ki) ^ (£ 2 , ^ 2 )- 


(19) 


Let w = vec{M — M*) = 9—9*. Note that by our notation. 


VeFn,Y{9*)=Yec 


f dFn^Y{M*) \ 

V dM,k J ■ 


We then have 


(VeFn,Y(n,w} = {VMFn,Y{M*),M - M*) (20) 

where {A,B) := tr(A^i3). Let Z = ^mFq^y{M*). There¬ 
fore, 


Zij — 


( fjM.j) 
\ PFhj) 


V«=i) 


1 - 



I((ij)en) • 


Using (1) and ( 6 ), we have 

E[Zy]=0, |Zy|<L„ E[Zl]<Ll. (21) 


We need the following result from [13] concerning spectral 
norms of random matrices for Lemma 6.2. 


Lemma 6.1: [13, Theorem 8.4] Take any two numbers 

m and n such that \ < n < m. Suppose that A = 
\Aij\i<i<rn,i<j<n Is a matrix whose entries are independent 
random variables that satisfy, for some G [0,1], 

E[Ay] = 0, E[A^j] < and < 1 a.s. 

Suppose that for some e: > 0. Then 

P {\\A\\2 > 2.01aV^) < 

where Ci{£) is a constant that depends only on e and C 2 is 
a positive universal constant. The same result is true when 
m = n and A is symmetric or skew-symmetric, with indepen¬ 
dent entries on and above the diagonal, all other assumptions 
remaining the same. Lastly, all results remain true if the as¬ 
sumption (T^ > is changed to > TO“^(log(m))®+^. 

Lemma 6.2: Let w = vec(M — M*) = 9 — 9*, 

and M,M* G C. Then with probability at least 1 — 
Ci(e) exp(—C 2 TO), we have 


{^eFn,Yi0*),w) 


< 2.01LaV^\\M - M*\\f , 
































where e € ( 0 , 1 ), C'i(e) is a constant that depends only on e 
and C 2 is a positive universal constant. 

Proof: Using (20), we have 

< ||VMFo.y(M*)||2||M* - M|U. (22) 

Let Z = Lffy Then we have ]E[Zy] = 0, 
\Zij\ < 1 and < 1. Applying Lemma 6.1 to Z 

with (T = 1 , we obtain \\Z \\2 < 2 . 01 i/m with probability 
at least 1 — Ci{e) exp{—C 2 m) for some positive constants 
Ci{e) and C 2 - W note that for any matrix A of rank r, 
^ with ||A||, denoting the nuclear norm. 

Hence \\M* — M||* < y/^\\M* — M\\p, yielding the desired 
result. ■ 


Lemma 6.3: Let w = vec(M — M*) = 9 — 9* and 
M,M* G C. Then for any 9 = 9*+ '){9 — 9*) and any 
7 G [ 0 , 1 ], we have 


tu, 




00 ^ 


Fq„y{ 0) w)>-fa\\{M 


Min 


II 


2 

F ' 


Proof: Using (5), (18) and (19), we have 
{w, VleFay{9) w) 

'd^Fn,Y{M)' 


= E 




(M., - M*^Y 


>7a E {M.j-M*Y^=^^\\{M-M*)JI , (23) 


which completes the proof. ■ 

We need a result similar to [7, Theorem 4.1] regarding 
closeness of a fixed matrix to its sampled version, which is 
proved therein for square matrices M* under an incoherence 
assumption on M*. In Lemma 6.4 we prove a similar result 
for rectangular Z with bounded ||2^||oo- Define 

= inf{max(||[/||2_^, ||U||2_^) : Z = UV^} , 

where for a matrix A, ||A|| 2 ,oo denotes the largest £2 norm of 
the rows in A , i.e, H^lb.oo = maxi ||C/i^.|l 2 . 

For Z G m > n, and define the operator TZq as 


Za = 'F.n{Z) 


Zy if(i,j)GU, 
0 otherwise. 


Lemma 6.4: Let G\f2 satisfy assumptions (Al) and (A2) 
in Section IV. Let Z G with rank(Z) < r. Then we 

have 


< 


ai(G) 
Yrmna2{G) 
^i(G) 


Rn-I) (Z) 


- ai(G) 


Halloo <Gm^ — ||Z||, 


(24) 

(25) 


Proof: By definition of ||Z||i„ax, there exist U G R™^^ 
and V G R”^^' for some 1 < k < min(m, n) such that Z = 
\\U\\loo < ll^lUax and ||U||i,^ < ||Z|Uax. Since 


rank(Z) < r, we have k < r, but this fact is not needed in 
our proof. By the variational definition of operator norm. 


i?n (^)-^||2 


'ai(G) 


max y 
x,y:\\xh^l^\\y \\2 


ai(G) 


Rq{Z) — z \ X. 


We also have Rn{Z) = ZoG where o denotes the Hadamard 
(elementwise) product. Letting U./ and V.^e respectively de¬ 
note the £-th column of U and V, we write 

k 


Z = J2 ^■4^- 




We therefore have 


^/mn 


TZq{Z) — z \ X 


= E 


/mn 


1—1 ^ ^ ' 


{y o U.,,yG{x o U 7 ) - {y^U.,e){x^V.,,)^ 


Normalize 1^ to unit norm as = 1mlV^n, and similarly 
for i„. Let yoU.^t = af,lm+ PtlmX where_ if^j_ is a unit 
norm vector orthogonal to Im- Then a( = 1^(2/ ° U-,i) = 
y^ U.^f,/y/m. Hence 


y 


(Tl(G) 

k 

n 

e=i 


TZfi{Z) - Z ) a; 


/mn 


cri(G) i^/m 


^y^U.,dmG{x oV.^i) 


+ pdm±G{xoV.,,) -{y'U.,,){x'V.,, 


= E 


\MG) 




(26) 


where we used the facts that l^G = (Ti(G)i^ and lY{x o 
V.^f) = FV.^ij^Jn. Since 1^ is the top left singular vector 
of G, we have 

li^yGzl < (J 2 {G)\\z \\2 for any 2 ; G R". 

Using the above inequality in (26) we obtain 


yfmn 

^Fg) 


Rn{Z) - Z\x 


< 


ai(G) 


a2(G)^|/3,|||a:oU7||2 


1=1 


< 


/mn 

oFG) 


^ 2 (G), 


E/^i 


\ t=l \ l=\ 


^||xoU,||i. (27) 


We have /3f = l^^Fv ° U-/)- Hence, |/3f| < \\y o U.Y\ 2 - 
Therefore, 

k k m k 


t=i 


1=1 


i=l 


= Ey'iif^viil < ni2.ooEy' ^ (^s) 































where we used Vi — Similarly, we have 

k n k 

i=i 1=1 


1=1 


= Y.^]\\VoA\l<\\v\\l^Y.^]< 

3=1 3=1 

It then follows from (27)-(29) that 


(29) 


\/mn 


Rq{Z) - Z ] X < 


/mn(J2{G) 

^i(G) 


(30) 

This establishes (24). Now use ||.^||max < V^ll-^lloo [5] and 
|0| = md to establish (25). ■ 

Lemma 6.5: Let M, M* G C. Then we have 

ai{G) 




^/2r 


\\M-M*\\p-2a^a2{G). 


Proof: Let Z = M — M*, a = ^fmn/a-i{G), and b = 
((J 2 (G)/ai(G))y/rmn. Then by Lemma 6.4 and the fact that 
rank(.Z) < rank(M) + rank(M*) < 2 r, we have 

|a||^n ||2 - ll^lbl < \\aZn - ZH 2 < b\\Z\\o.. (31) 

Using Halloo = \\M - M*||oo < ||M||oo + ||M*|U < 2a, 
(31) can be expressed as \\Z \\2 < a||.Zn ||2 + 2ab. Since 
ll^lb < ll^llf’ VA, we then have \\Z \\2 < a||Zn||F + 2 a 6 . 
Since ||A||ir < y^rank(A)||A ||2 VA, we have ||.Z'||f < 
V^HZlA < V^all^Alp + 2\/^ab, leading to the desired 
result. ■ 

Proof of Theorem 3.1: Consider Fn,y(0) = Fq,.y{M)- 
The objective function Fq^y{M) is continuous in M and the 
set C is compact, therefore, Fq^y{M) achieves a minimum 
in C. Now suppose that M G C minimizes Fq Then 

Fo.y(M) < Fq^y{M) VM G C, including M = M*. Define 

Cg = 2.01L„V^, Ch = ■ (32) 

Ibrmn 

Using (16) and Lemmas 6.2 and 6.3, we have w.h.p. (specified 
in Lemma 6.2) 

Fq,,y{M) 

> Fn.YiM*) - Cg\\M - M*\\f + ^\\{M- M*)a|l|. 
Since M minimizes Fq^ y{FI), we have 
0>Fn,Y{M)-Fn,YiM*) 

> -Cg\\M-M*\\F + - M*)n\\%. (33) 

Set T] = 2 ar((T 2 (G)/< 7 i {G))'j2mn and po = o\((G)l\/2rmn. 
Then Lemma 6.5 implies ||(M — M*)q\\p > 

r]t^\M — M*\\f — r]]. Now ^ _ consider two cases: (i) 

\\M - M*\\f < 2r], (ii) \\M - M*\\f > In case 
(i), we clearly have an obvious upperbound on ||M — 44*11^’. 
Turning to case (ii), we have 

\\M-M*\\F-r]>\\M-M*\\F-l\\M-M*\\F 

= ^\\M-M*\\f. 


Using (33), (34) and Lemma 6.5 with M = M, we have 


0 > Fn^Y{M) — Fq,^y[M) 

>-Cg\\M - M\\f + Ch\\M - M\\l 


= \\M-M\\f 


— Cg + Ch\\M — M\\f 


( 35 ) 


In order for (35) to be true, we must have \\M — M*\\f < 
Cg/ch otherwise the right-side of (35) is positive violating (35). 
Combining the two cases, we obtain 


\\M—M*\\f < max ( 2?7, — 
Ch 


= max ^darV^ 


-<J2(G) 'i2.1b'/2La(rmY-^n 


ai{Gy 


7 aCT?(G) 


(36) 


This is the bound stated in (8) of the theorem after division 
by ^fmn. The high probability stated in the theorem follows 
from Lemma 6.2 after setting e = 0.5. Finally, we use 


0 - 2 (G)/cri(G) < Cj'/d = Cy/m/yJ\fT\ and \/ai{G) < 
l/d? = mf /yVf to derive (9). ■ 
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